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Abstract 

In the study of the conformational behavior of complex systems, such 
as proteins, several related statistical measures are commonly used to com- 
pare two different potential energy functions. Among them, the Pearson's 
correlation coefficient r has no units and allows only semi-quantitative 
statements to be made. Those that do have units of energy and whose 
value may be compared to a physically relevant scale, such as the root 
mean square deviation (RMSD), the mean error of the energies (ER), the 
standard deviation of the error (SDER) or the mean absolute error (AER) , 
overestimate the distance between potentials. Moreover, their precise sta- 
tistical meaning is far from clear. In this article, a new measure of the 
distance between potential energy functions is defined which overcomes 
the aforementioned difficulties. In addition, its precise physical meaning 
is discussed, the important issue of its additivity is investigated and some 
possible applications are proposed. Finally, two of these applications are 
illustrated with practical examples: the study of the van der Waals en- 
ergy, as implemented in CHARMM, in the Trp-Cage protein (PDB code 
1L2Y) and the comparison of different levels of the theory in the ab initio 
study of the Ramachandran map of the model peptide HCO-L-Ala-NH2. 

PACS: 87.14.Ee, 87.15.-v, 87.15.Aa, 87.15. Cc, 89.75.-k 



1 Introduction 

The most fundamental way to account for the behavior of a physical system is 
through its energy function H(q,p), which depends on the coordinates q and 
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the momenta p of all the particles. In normal situations, this function can be 
expressed as the sum of the kinetic energy K^p) 1 and the potential energy 
V (q). Because the former is of general form for any type of system and, normally, 
it does not affect the equilibrium properties, the latter is enough for a complete 
characterization of the problem. 

In the study and simulation of complex systems, such as proteins, researchers 
often face the dilemma of choosing among many different ways of calculating a 
conceptually unique potential energy V [1-7]. Others tackle the problem of de- 
signing new algorithms to perform this calculation looking for the improvement 
of the relation between accuracy and numerical complexity [8-19]. 

The energy V studied may be the total potential energy of the system or 
any of the terms in which it is traditionally factorized 2 and the different ways of 
calculating it may stem from distinct origins, namely, that different algorithms 
or approximations A are used, that the potential energy function depends on a 
number of free parameters P or that it is computed on different but somehow 
related systems S (for a proper definition of this, see Sec.[BJ|. 

Changes in these inputs produce different instances of the same physical 
potential energy, which we denote by subscripting V. For example, if it is 
calculated on the same system S, the algorithm and approximations A are held 
constant but two different set of parameters Pi and P2 are used, our notation 
made explicit would read as in the following equations 3 : 

VM := V(A, P u q, S) and V 2 (q) := V(A, P 2 ,q, S) . (1) 

For each practical application of these two potential energy functions, there 
is a limit on how different V\ could be from V2 to preserve the relevant features 
of the system under scrutiny. Clearly, if V\ is too distant from V2, the key 
characteristics of the system behavior will be lost when going from one function 
to the other. 

In the literature, a number of different methods are used to quantify this dis- 
tance. Among them, the Pearson's correlation coefficient r does not have units 
and its meaning is only semi-quantitative. Some others, such as the root mean 
square deviation (RMSD), the mean error of the energies (ER), the standard 
deviation of the error (SDER) or the mean absolute error (AER), do have units 
of energy and their values can be compared to the physically relevant scale in 
each problem. However, they tend to overestimate the sought distance even in 
the interesting situations in which the potential energy functions under study 
are physically proximate. The aim of this article is to define, justify and describe 
a new meaningful measure d(Vi,V2) of the distance between two instances of 
the same potential energy that overcomes the aforementioned difficulties, and 

The kinetic energy K depends, in general, on the positions and the momenta. However, 
if Cartesian coordinates are used, the dependence on positions vanishes. 

2 For example, in the case of proteins [20,21], some of the terms in which the total potential 
energy is typically factorized are the hydrogen-bonds energy, the van der Waals interaction, 
the excluded volume repulsion, the Coulomb energy and the solvation energy. 

3 Analogous definitions may be made if different algorithms or approximations, Ai and A2, 
are used or if V is computed on two related systems, Si and S2- 



2 



that allows to make precise statistical statements about the way in which the 
energy differences change when going from V\ to V2 4 . 

In Sec. |21 the hypothesis made on V\ and V2 to accomplish this arc outlined 
and, in Sec. 13 the central object, d(V\ : V2), is defined. The statistical meaning 
of the distance herein introduced is discussed in Sec. 01 and [5] and some of its pos- 
sible applications to practical situations are proposed in Sec. |§1 A comparison to 
other commonly used criteria is made and illustrated with a numerical example 
in Sec. [7| The important issue of the additivity of d(Vi, V2), when the potentials 
studied are only a part of the total energy, is investigated in Sec. OH and, in Ap- 
pendix A, the metric properties of our distance are discussed. The robustness 
of the van der Waals potential energy (as implemented in CHARMM [23,24]) 
under a change in the free parameters and the ab initio Ramachandran plots of 
HCO-L-Ala-NH2 at different levels of the theory are studied in Sec.|5]as exam- 
ples of applications of the distance. Finally, SecUHlis devoted to the conclusions 
and to a useful summary of the steps that must be followed to use the distance 
in a practical case. 

2 Hypothesis 

In some cases traditionally studied in physics, the dependence of the poten- 
tial energy V on the parameters is simple enough to allow a closed functional 
dependence V^Vi) to be found 5 . However, in the study of complex systems, 
such as proteins, this dependence is often much more complicated, due to the 
high dimensionality of the conformational space and to the fact that the energy 
landscape lacks any evident symmetry. The set C(Vi) of the conformations 
with a particular value of the potential energy V\ typically spans large regions 
of the phase space containing structurally different conformations (see Fig. 
When the system is slightly modified, from S\ to S2, or an approximation is 
performed (or the algorithm is changed), from A\ to A2, or the free parameters 
are shifted, from Pi to P2, each conformation q in C(Vi) is affected in a different 
way and its potential energy is modified, from V\{q) to V2{q), in a manner that 
does not depend trivially on the particular region of the phase space which the 
conformation q belongs to. In such simple functional relation V^Vi) 

is no longer possible to be found: for each value of V\, there corresponds now a 
whole distribution of values of V2 associated with conformations which share the 
same value of V\ but which are far apart in the conformational space. Moreover, 
the projection of this high-dimensional cf-spacc into the 1-dimcnsional V^-space 
makes it possible to treat V2 as a random variable parametrically dependent on 
Vi (see Fig. |2J, in the already suggested sense that, if one chooses at random 

4 The convenience of this approach has been remarked in [22]. Note, however, that in this 
article a different distance is defined. 

5 For example, if the recovering force constant of a harmonic oscillator is changed from k\ 
to &2, the potential energy functions satisfy the linear relation V2(q) = (ki/k2)Vi(q) for all 
the conformations of the system; if the the atomic charges are rescaled by a factor a (being 
actually aQi) and a is changed from 01 to a.2, the free energies of solvation calculated via 
the Poisson equation satisfy the linear relation V2{q) = (ati/at2) 2 Vi(q), etc. 
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a particular conformation qi £ C{V\), the outcome of the quantity V^cfi) is 
basically unpredictable 6 . 




Figure 1: Space of constant potential energy V in a simple system with only two 
degrees of freedom: an alanine dipeptide in vacuo. Potential energy surface (PES) 
calculated ab initio at the B3LYP/6-311H — hG(d,p) level of the theory in (Perczel, A. 
et al. 2003, J. Comp. Chern. 24:1026-1042). 




Figure 2: Illustration of the conditional-probability density function -P2|i(V2|Vi) at 
different Vi-positions. The points represent single conformations of the system. The 
Vi -conditioned mean of V2, (T^)(Vi), is depicted as a solid line and, although in the 
hypothesis it is assumed to depend linearly on Vi, here it is shown as a more general 
function to better illustrate the concepts involved. Analogously, the Vi-conditioned 
standard deviation of V2, 0-12 (Vi) (which is assumed to be constant in the hypothesis) 
is added to (V^HVi) (and subtracted from it) and the result is depicted as broken lines. 

In this context, the hypothesis to be done about the two instances of V are, 
first, that the pair of values (V\ (<fi), V^gi)) is independent of (Vi(qj), V^g})) if 

6 The same may be said in the case that the conformations belong to C(V2) and the random 
variable is, in turn, V\. The role of the two instances of V is interchangeable in the whole 
following reasoning, however, for the sake of clarity, this fact will be made explicit in some 
cases and will be tacitly assumed in others. 
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i 7^ j and, second, that the probability distribution of V2 conditioned by V\ is 
normal with mean 612V1 + ai2 and standard deviation a\2 and that, conversely, 
the probability distribution of Vi conditioned by V% (i.e., the distribution of 
the random variable V\ in the space C(V2), analogous to C(V\)) is normal with 
mean 621 V% + 0,21 and standard deviation 021. Where 012, 612, 012 j 021, 621 and 
1T21 are constants not dependent on V\ or V2. This can be summarized in the 
following expressions for the conditional-probability density functions: 



P 2 |i(W) = -=L — exp 

V27T(7l2 

Pi|2(Vi|V a ) = -=^exp 

V 27T(721 



(V2 - {b l2 Vx + a 12 )) 5 



(^i-(&2iV2+a 2 i)) 2 



2^1 



(2a) 
(2b) 



In general, one may reason about the whole conformational space of the sys- 
tem C and regard each randomly selected conformation <fj as a single numerical 
experiment to which the value of two random variables, Vi(gi) and V^cfi), can 
be assigned. However, no hypothesis need to be made about the joint proba- 
bility density function Pi2(^i,^2) 7 - For the distance herein introduced to be 
meaningful, it suffices to assume Eq. [21 

Regarding the question of whether in a typical case these hypothesis are ful- 
filled or not, some remarks should be made. First, the satisfaction of the inde- 
pendence hypothesis depends mainly on the process through which the working 
set of conformations {qi]f =1 is generated. For example, if the conformations 
are extracted from a single molecular dynamics trajectory letting only a short 
simulation time pass between any pair of them, their energies will be obviously 
correlated and the independence will be broken. If, on the contrary, each con- 
formation q\ is taken from a different trajectory (see the first example in Sec.|5J), 
one may reasonably expect this assumption to be fulfilled, i.e., the independence 
hypothesis is normally under researcher's control. 

The normality hypothesis, however, is of a different nature. That the dis- 
tribution of V2 be normal for a particular value of V\ may be thought as a 
consequence of the large number of degrees of freedom the system possesses, 
of the usual pairwise additivity of the forces involved and of the Central Limit 
Theorem (this, in fact, can be proved in some simple cases). Nevertheless, 
that the V\ -conditioned mean of V2, (V2)(Vi), is linear in V\ and that the V\- 
conditioned standard deviation of V2, cr^iVi), is a constant must be regarded 
as a zeroth order approximation that should be checked in each particular case 
(see Fig. 0). It is worth pointing out, however, that, for the commonly used 
statistical quantities r, RMSD, etc. to be useful, this assumption must also be 
made (see Sec. 0) and also that it has been found to be approximately fulfilled 
in several cases studied (see, for example, [22] and Fig.|2>). 

7 The hypothesis that Pi2(Vi,V2) is bivariate normal, for example, is stronger than the 
assumptions in Eq. |5] The latter can be derived from the former. 



5 



3 Definition 



For the aforementioned cases in which the dependence of the potential energy 
on the parameters is simple enough (see footnote |SJ , one can describe V^Vi) 
by a closed analytical formula and exactly compute a%2, 612 and a\2 (this last 
quantity being equal to zero in such a situation). However, in a general situation, 
the parameters entering Eq.|21can not be calculated analytically. In such a case, 
one may at most have a finite collection of N conformations {qi\f = i and the 
respective values V\(qi) and V 2 (qi) for each one of them. 

From this finite knowledge about the system, one may statistically estimate 
the values of a\2, &12 and a\2- Under the hypothesis assumed in the previous 
section, the least-squares estimators [25,26] of these quantities are optimal in the 
precise statistical sense that they are maximum-likelihood and have minimum 
variance in the class of linear and unbiased estimators 8 [27] . 

If we denote Vj 8 := V\{qi) and V 2 l := ^(gi), and N is the number of confor- 
mations in the working set, the mean-squares maximum-likelihood estimators 9 
of eti2, 612 and CT12 are given by the following expressions [25,26]: 



Cov(Vi,V 2 ) 



a\2 = M2 - 



CT12 



££i(*?-(W + ai2)) ; 
N 



1 1/2 



(3a) 
(3b) 

(3c) 



where: 



M2 



N ^ 1 ' 

i=l 
1 N 

i=l 

i=l 
1 W 

CovCVi, U 2 ) := ^ - - Ma) 



n 1/2 



(4a) 
(4b) 

(4c) 
(4d) 



8 The same letters are used for the ideal parameters 012, 612, and (T12 and for their least- 
squares best estimators, because the only knowledge that one may have about the former 
comes from the calculation of the latter. 

9 In this article, the maximum-likelihood estimator for 012 (with N in the denominator) is 
preferred to the unbiased one (with TV — 2 in the denominator) for consistency. Anyway, for 
the values of N typically used, the difference between them is negligible. 
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The quantities with 21 subscripts are found by changing 1 <-» 2 in the pre- 
ceding expressions and the central object of this article, the distance d(V\, V 2 ) 
between two different instances of the same potential energy V is defined as 
follows: 

d(Vi,V 2 ) <%if * ■ (5) 

It must be stressed here that the measured distance depends on the working 
set, {q}}^L 1 , of conformations chosen. More precisely, it depends on the occur- 
rence probability of an arbitrary conformation q in the set. This probability 
must be decided a priori from considerations regarding which regions of the 
phase space arc more relevant to answer the questions posed and up to what 
extent. For example, if one believes the system under study to be in thermody- 
namical equilibrium, then, it would be reasonable to generate a working set in 
which the probability that q occurs is proportional to its Boltzmann weight. If, 
on the contrary, one doubts whether the system is crgodic or not (as in the case 
of proteins) or one simply wants to study in detail the dynamical trajectories 
out of equilibrium, then, all the conformations in the phase space should be 
weighted equally and the probability should be flat. 



4 Meaning 

Under the hypothesis made in Sec. [3 (independence and Eq. 0), a simple ex- 
pression may be written for the probability density function of the VVenergy 
differences AV 2 conditioned by the knowledge of the Vi-energy differences AVj.: 



P A2lA1 (AV 2 \AV 1 ) = -=L— exp 
V 27rdi2 



(AV2 - b 12 AVi) 2 



2d{ 2 

The quantity d\ 2 in this equation is defined as follows: 



(6) 



d 12 := V2a 12 . (7) 
It is related to the distance defined in Eq.[S]via the following expression: 

/j2 ,J2 \i/2 

d ( Vl ,v 2 ) = ( ( hi±hi\ . (8 ) 



And it encodes the loss of information involved in the transit from V\ to V 2 
through the following important properties: 

1. The addition of an energy reference shift 012 between V\ and V 2 has neither 
an implication in the physical behavior of the system nor in the numerical 
value of d\ 2 - 

2. One of the novel features of the distance herein defined is that no loss 
of information is considered to occur (i.e., d\ 2 = 0) if there is only a 
constant rescaling 612 between the two potential energy functions studied. 
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Figure 3: (a) Vi- and V2-energies of the set of 1100 conformations of the Trp-Cage pro- 
tein used in the first example of Sec. [5] Both potentials are the van der Waals energy as 
implemented in CHARMM; Vi corresponds to R c = 1.275 A and V 2 to R c = 3.275 A, 
ec is fixed to —0.020 kcal/mol. The values of AVi and AV2 for a selected pair of con- 
formations are also depicted. The solid line represents the least-squares fit and the 
region where the probability of finding a conformation is largest is enclosed by broken 
lines, (b) Histogram of the residues ei2(<fi) := Vziqi) — [6i2Vi(ifi) + ai 2 ] associated to 
the mean-squares fit in Fig. a. 

Although such a transformation does have physical implications and would 
change the transition rates in a molecular dynamics simulation and alter 
the effective temperature in any typical Monte Carlo algorithm, Vi can be 
easily recovered from V2, if pertinent, upon division of V% by 612. If the 
two potential energy functions are on equal footing (e.g., they correspond 
to different values of the free parameters (see Sec. 01), there is no correct 
energy scale defined. However, in the case that the distance is used to 
compare an approximated potential to a more ab initio one or even to 
experimental data, the correct energy scale must be considered to be that 
of the more reliable potential and the rescaling 612 may be safely removed 
as indicated above. Note that the quantity d\2 changes, when this removal 
is performed, from d\2 to c£i2 / 1&12 1 (to see this, take the analogous for 02 
of Ea. Iceland change V$ by V^/b^, finally, take the result to Ea. ll2a(l and 
it is the second value which must be considered as the relevant one. 





3. Directly from its very definition in Eq. [7J one has that d\2 = is equiv- 
alent to V2 being exactly a linear transformation of V\, i.e., to V2(<fi) = 



4. As stated above, 612 ^ 1 and/or 012 7^ must be regarded as two different 
types of systematic error easily removable and not involving any loss of 
information when one changes V\ by V 2 . In the general case, however, the 
energy differences associated to each potential energy function (which are 
the relevant physical quantities that govern the system behavior) present 
an additional random error which is intrinsic to the discrepancies between 
the potentials and can not be removed. As can be seen in Eq. [S| in this 
situation, d\ 2 is the standard deviation of the random variable AV 2 and, as 
its value decreases, the distribution becomes sharper around the average 
612 AV\. Moreover, because the distribution is normal, the probability of 
AV 2 being in the interval (b V2 AVi - Kdi2,b 12 AVi + Kd X2 ) is ~ 38% for 
K = 1/2, - 68% for K = 1, ~ 95% for K = 2, etc. Hence, d V2 quantifies 
the random error between the trivially transformed potential b\ 2 Vi + a\ 2 
and V2, i.e., the unavoidable and fundamentally statistical part of the 
difference between V\ and V 2 which stems from the complex character of 
the system. 

5. To gain some insight about the meaning of d\ 2 , the following gedanken 
experiment may be performed: if a Gaussian noise with zero mean and 
variance equal to s 2 were independently added to the linearly transformed 
l^i-energy, b\ 2 Vi (q) + a\ 2 , of each conformation and the resulting potential 
were denoted by V 2 , then one would have that the hypothesis in Eq. I2al 
is fulfilled and that d\ 2 = \f2s. Therefore, d\ 2 may be regarded (except 
for a harmless factor V2) as the size of the Gaussian noise arising in the 
whole energy landscape when one changes b\ 2 V\ + a\ 2 by V 2 . 

6. Closely related to the properties in the two preceding points, an illumi- 
nating statistical statement about the energetic ordering of the conforma- 
tions can be derived from Eq. The probability that the energetic order 
of two randomly selected conformations is maintained when going from 
V\ to V 2 (more precisely, that sign{AV 2 ) = sign(bi 2 AV\)) conditioned by 
the knowledge of AVi, can be easily shown to be: 



The intuitive meaning of this expression is that d\ 2 is the V\ -energy dif- 
ference at which two randomly selected conformations can be typically 
resolved using V 2 after the removal of the harmless rescaling 612 (see 
Pt.gJ). Certainly, if one has that |&i 2 AVi| <C di2, then P ord = 1/2, reflect- 
ing a total lack of knowledge about the sign of AV 2 and, consequently, 
V 2 can not be used to resolve Vi-encrgy differences. If, on the contrary, 
I&12AV1I 3> d\2, then P OI d = 1 and the conformations ordering is exactly 



bi 2 V 1 {q i ) + a 12 , Vffi € {%} 



N 

i=l- 




(9) 
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conserved. In any intermediate point, P orc i is a rapidly increasing function 
of |6i2 A.Vi |/ c?i2 that reaches a reasonable value (~ 84%) when its argu- 
ment equals 1, i.e., when I&12AV1I = di 2 - Some other interesting points 
arc P ord (l/2) ~ 69% or P ord (2) ~ 98%. 

7. Finally, some clarifying properties of the distance associated to its relation 
to the Pearson's correlation coefficient will be investigated in Sec. [7| (see 
specially Ea. ll2afl . 

The same considerations may be done about d 2 \ regarding the transit from 
V 2 to V\ and, as can be seen in Eq. 03 the square of d(Vi, V 2 ) is the mean of 
the squares of d 12 and c? 2 i- Therefore, this measure of the difference between 
potential energy functions quantifies the average size of the uncertainty in the 
energy differences of the system that arises from changing one of the potentials 
studied by the other. If the comparison is performed between potential energy 
functions that stand on the same footing (see, for example, the second possible 
application in Sec. EJ, the symmetric quantity d(Vi,V^) should be used as a 
summarizing measure of the loss of information involved in the transit from V\ 
to V2 and vice versa. However, if one of the potentials is a priori considered to be 
more ab initio or more accurate and it is compared to a less reliable instance, V\ 
may denote the former, V 2 the latter and one may use only d 12 as the measure 
of the discrepancies between them 10 . 

Hence, although both the discussion regarding the relevant values of the 
distance in the following section and the investigation of its mathematical prop- 
erties in Sec. |H1 and Appendix A are referred to d(Vi,V 2 ) for generality, they 
may be equally applied to d 12 ■ Conversely, the comparison between d\ 2 and the 
quantities commonly used in the literature done in Sec. may be extended to 
d(Vi, V2) upon symmetrization of ER, which is the only asymmetrical one. 

5 Relevant values of the distance 

Regarding the value of d{V\,V 2 ) in a practical case, some remarks must be 
made. One may expect two special values of the distance to exist: d mnl and 
rfmax- In such a way that, if d{V\, V2) < d mnl , one potential energy function may 
be substituted by the other without altering the key characteristics of the system 
behavior, and that, if d{V\, V 2 ) > d max , then, the substitution is not acceptable. 
This limiting values must be set depending on the particularities of the system 
studied and on the questions sought to be answered, and it may even be the case 
that some special features of the energy landscape are the main responsible of the 
behavior under scrutiny. For example, we are not going to establish any strict 
limit on the accuracy required for a potential energy function to successfully 
predict the folding of proteins [20,21]. We consider this question a difficult 
theoretical issue, whose solution probably requires a much deeper knowledge 
of the protein folding problem itself than the one that exists at present, and 

10 Note, from Eq.0 that, if d 12 = d 2 i, then d(Vi, V2) = d\i = d 2 \. 
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we believe that it may be possible a priori that some special features of the 
energy landscapes of proteins (such as a funnel-like shape [28,29]) are the main 
responsible of the high efficiency and cooperativity of the folding process [20,21]. 
If this were the different procedure for measuring the distance between 

potential energy functions could be devised for this situation [30-32], as any 
change of V\ by V2 which did not significantly alter these special features would 
be valid even if the value of d(Vi, V2) were very large. Our definition of d(Vi, V2), 
being based in characteristics shared by many complex systems and statistically 
referred to the whole energy landscape, is of more general application but cannot 
detect such particular features as the ones mentioned. 

However, due to the laws of Statistical Mechanics, a rather stringent but 
general value for d min can be used to a priori assess the interchangeability 
of V\ and V2. As can be seen in the thermodynamical equilibrium Boltz- 
mann distribution, in which the probability pi of a conformation q\ is propor- 
tional to exp(—V(qi)/RT), the order of the physical uncertainty in the po- 
tential energies of a system in contact with a thermal reservoir at temper- 
ature T is given by the quantity RT 11 . This typical energy sets the scale 
of the thermal fluctuations and it also determines the transition probability, 
min[l, exp(— (V(q i+ i) — V(qi))/RT)], in the Metropolis Monte Carlo scheme 
and the spread of the stochastic term in the Langevin equation [33,34]. Conse- 
quently, in this article, RT (which equals ~ 0.6 kcal/mol at room temperature) 
will be used as a general lower bound for d mm . The results will be presented in 
units of RT and any two instances V\ and V2 of the same potential energy func- 
tion whose distance d(Vi, V2) be smaller than RT will be regarded as physically 
equivalent 12 . 

Regarding d m ax, no estimations of its value can a priori be made without 
referring to the particular potential energy functions compared and the relevant 
behavior studied. The fact that Ea. ll2bl has an absolute maximum when t\2 = 
sets only the worst possible upper bound and is only of mathematical interest. 

6 Possible applications 

There are at least three basic situations in which the distance defined in this 
article may be used to quantify the discrepancies between two different instances, 
V\ and V2, of the same potential energy: 

• If the difference between V\ and V2 arise from the use of two distinct 
algorithms or approximations, d(Vi, V2) (or d\2, see the final lines of Sec.0J 
may help us to decide whether the less numerically complex instance could 
be used or not. For example, one may compare the electrostatic part of 
the solvation energy calculated via solving the Poisson equation [36-39] 

11 RT is preferred to fcgT because per mole energy units are used in this article. 

12 This discussion is closely related to the common use of the concept of chemical accuracy, 
typically defined in the field of ab initio quantum chemistry as predicting bond-breaking 
energies to 1 kcal/mol [35]. 
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with the instance of the same energy calculated using one of the many 
implementations of the Generalized Born model [9,11-13,16,19,40,41], 
which are much less computationally demanding and more suitable for 
simulating macromolecules. If the distance between them is small enough 
for the behavior under study not to be much modified (see Sec. El, the 
latter could be used. The second example in Scc.^is devoted to illustrate 
this type of comparison, which, of the three possible applications proposed 
in this section, is the one most commonly found in the literature. 

• If the algorithm and the approximations are fixed and only one system 
S is studied, any reasonable functional form used to account for V will 
be a simplified model of physical reality and it will contain a number of 
free parameters P. These parameters, which, in most of the cases, are not 
physically observable, must be fit against experimental or more ab initio 
results before using the function for practical purposes. For any fit to yield 
statistically significant values of the parameters, the particular region of 
the parameter space in which the final result lies must have the property of 
robustness, i.e., it must occur that, if the found set of parameters values is 
slightly changed, then, the relevant characteristics of the potential energy 
function which depends on them are also approximately kept unmodified. 
If this were not the case, a new fit, performed using a different set of exper- 
imental (or more ab initio) points, could produce a very distant potential. 
If Vi and V<i come from the same family of potential energy functions and 
they correspond to different values of the free parameters P, the distance 
cf(Vi, Va) between them may help us to assess the robustness of the po- 
tentials under changes on the parameters. In the first example of Sec. El 
the robustness of the van der Waals potential energy implemented in the 
well-known molecular dynamics program CHARMM [23,24] is quantified 
as an example of this. 

• The last application of the distance is fundamentally different of the ones 
in the previous points but, although the reasoning throughout the article 
is intentionally biased, for the sake of clarity, toward the study of poten- 
tial energies of the same system, one may appeal to the same underlying 
assumptions to compare two different systems, Si and S2, provided that 
a meaningful mapping can be established between both conformational 
spaces 13 . For example, if the conformations of a particular protein are 
described only by their backbone angles, one can define an unambiguous 
mapping between the conformations of, say, the wild-type chain and any 
mutated form, in such a way that V\(q) would represent the energies of 
the former and V2(q) those of the latter. The distance d(Vi,V2), in this 
case, quantifies how different the energy landscapes of the two systems are 
and, depending on the features under study, how sensitive the behavior of 

13 In short, for the distance criterion to be applied, one needs to be able to assign two 
energies, Vi(q) and V2{q), to each conformation q. This is done trivially in the first two points 
but it requires a mapping between the conformational spaces of Si and 52 in the third case. 
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the protein is to mutations. The comparison of a potential, Vi, to another 
one, V2, which comes from the first via integrating- out certain degrees 
of freedom and which is commonly termed effective potential energy [42], 
may be considered to be another example of this type of application. 

7 Relation to other statistical quantities 

In the literature, some comparisons between potentials 1 are performed a poste- 
riori, i.e., not directly studying the energies but computing some derived quan- 
tities, such as the pK a of titratable groups [9, 12], investigating molecular dy- 
namics trajectories [3,6,8, 13], comparing the ability of the different instances 
of V to select the correct native state of a protein from a set of decoys [8], etc. 

For the a priori comparison of two ways of calculating the same potential 
energy, one may investigate the whole energy landscape visually if the system 
has no more than two degrees of freedom [14,17], but, if the object of study is a 
protein or another complex system, the vastness of the conformational space and 
its lack of symmetries require the utilization of statistical quantities calculated 
from the energies of a finite set of conformations. Among the most common such 
measures, one may find the root mean square deviation (RMSD) [1,6,13,14,16], 
the mean error of the energies (ER) [1,7,10], the standard deviation of the error 
(SDER) [7], the mean of the absolute error (AER) [11], all of which have units 
of energy, and the Pearson's correlation coefficient r [2, 4, 6, 7, 10, 15, 19], which 
does not have units. Finally, in [16], a root mean square of the difference in the 
relative energies (REL) (see Eg. II Pel for a clarification) which is proximate to 
di2 is defined, however, it has not been found to be used in any other work. 

If we use the same notation as in Eq. |3|and we define AT^ := V£ — V{, the 
statistical quantities mentioned in the preceding paragraph (except r, which will 
be discussed later) are given by the expressions: 

14 It must be pointed out that we have only found in the literature examples of the com- 
parison between two potentials corresponding to the first case described in Sec. |S] which is 
associated to different algorithms or approximations. No examples of robustness studies have 
been found and, regarding the third case, in which the differences arise from a slight change 
in the system, such as a mutation in a protein, only articles investigating the total free energy 
of folding have been found [43,44]. 
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In the following points, this measures of the difference between potential 
energy functions are individually compared to the distance defined in Sec. [21 
and their limitations with respect to di 2 are pointed out 15 : 

• The first one, the RMSD, which is one of the most commonly used, 
presents the major flaw of overestimating the importance of an energy 
reference shift between V± and V 2 . This transformation, which has no 
physical implications in the conformational behavior of the system, must 
not influence the assessment of the difference between potentials. This fact 
is, for example, detected in some of the comparisons performed in [6] and 
recognized to be conceptually erroneous in [1] , where the shift is removed 
by minimizing the RMSD with respect to it. In addition, the RMSD also 
overestimates the effect of a slope 612 7^ 1 between the two potentials, a 
fact that, as has been remarked in Sec. 01 is not desirable (for a practical 
case in which the loss of information is small but 612 ^ 1 see [10]; for a nu- 
merical example see Fig.QJand the discussion at the end of this section). It 
can be proved that, if bi 2 = 1 and a\ 2 = 0, then RMSD(Vi, V 2 ) = d\ 2 /^/2. 

• ER, in turn, only accounts for a systematic error between the two poten- 
tials (an offset). The relation ER(Vi, V 2 ) = [i 2 — [i\ holds and ER equals 
the energy-reference shift a\ 2 if b\ 2 = 1 (see Eg. I3b|l . Thus, the changes 
in the conformational behavior of the system are not reflected by this 
quantity. 

• In SDER, the standard deviation associated to ER, the reference shift 
is removed by subtracting ER from each difference AVi 2 . However, this 



15 The quantity ER is not symmetrical. This is why all the measures in Ea. llUl arc compared 
to di2 and not to its symmetrized version d(Vi,V2) (see Appendix A and the final part of 
Sec. |3 . 
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quantity still overestimates the effect of a slope 612 7^ 1, in fact, only if 
612 = 1, one has that SDER(Vi, V 2 ) = d 12 /V2. 

• To establish precise relations between AER and di 2 is difficult because of 
the modulus function that enters this quantity. Nevertheless, it is clear 
from its definition that AER, like ER, overestimates both the effect of an 
energy reference shift a\ 2 and of a slope b\ 2 7^ 1 . For a numerical check of 
this fact, see Tab. Q] 

• Finally, the measure REL, introduced in [16], has much of the spirit of the 
distance defined in this work. On one hand, it focuses on the energy differ- 
ences, which arc indeed the relevant physical quantities to study the con- 
formational behavior of the system, on the other hand, it correctly removes 
the effect of an energy reference shift a\ 2 . However, it still overestimates 
the importance of a slope 612 7^ 1 and one only has that REL(Vi , V 2 ) — d\ 2 
if 612 = 1. 

There is yet another quantity commonly used for measuring the differences 
between two potentials: the Pearson's correlation coefficient (denoted by ri2 in 
the following): 

U\(J 2 

This statistical measure differs from the ones discussed above in several 
points. On one hand, it has no units; a fact that renders difficult to extract 
from its value relevant statements about the energies studied. Some statistical 
statements can be made about the real value of ri2 (the value in an infinite 
sample, denoted by P12), however, to do this, the sampling distribution of ri2 
must be known. Without making stringent assumptions about the joint distri- 
bution Pi 2 (Vi,V 2 ) (see Sec. [^J only the null hypothesis of p 12 being equal to 
can be rejected from the knowledge of r 12 in a finite sample [45]. This is clearly 
insufficient, because, in the vast majority of the cases, the researcher knows that 
the two potentials arc correlated, i.e., the null hypothesis can be easily rejected 
from a priori considerations. If, in turn, one assumes Pi 2 {Vi, V 2 ) to be bivariate 
Gaussian, the Fisher transformation can be used to make inferences about p\ 2 
which are more general than p\ 2 7^ [45]. In any case, unfortunately, these type 
of statements are not directly translated into statements regarding the energies; 
a fact that undermines much of the physical meaning in ri2- 

On the other hand and despite the disadvantages remarked in the preceding 
lines, r\ 2 behaves satisfactorily when an energy reference a% 2 is added or when 
a rescaling b\ 2 7^ 1 is introduced between V± and V 2 ; like d\ 2 , and in contrast to 
RMSD, ER, SDER, AER and REL, the Pearson's correlation coefficient does 
not overestimate such transformations, in fact, r\ 2 is completely insensitive to 
them. Therefore, it is not surprising that a simple general relation can be written 
between both r\ 2 and d\ 2 : 
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d 12 =V2<r 2 (l-r? 2 ) 1 / 2 , (12a) 
d(V u V 2 ) = [(at+a 2 2 )(l-r 2 12 )} 1 / 2 . (12b) 

In these expressions, it can be observed that the distance herein introduced 
depends on two factors: on one side, the width of the probability distributions 
associated to the potentials <J\ and a 2 , which set the physical scale and give 
energy units to d(Vi,V2), on the other, the quantity 1 — rj 2 , which measures 
the degree of correlation between V\ and V 2 . The second factor is completely 
insensitive to a change in the energy reference shift or in the slope (due to the 
properties of ryi); the part that depends on the width of the distributions, in 
turn, makes the distance sensitive to a change in the slope (remaining insensitive 
to a change in the reference), through a 2 if the rescaling is performed on V 2 
(<j 2 ^ C2/I612 1)- However, contrarily to the case of the quantities in Eq. I1UI 
the implications of such a transformation are not overestimated. In the case of 
our distance, the sensitivity to a rescaling arises only from the dilatation of the 
random errors, whereas the other quantities take erroneously into account the 
fact that the best fit line is not necessarily parallel to the line V 2 = V\ (see the 
following numerical example). 

In short, the distance defined in this article consists in a physically mean- 
ingful way of giving energy units to the Pearson's correlation coefficient. 



a) 

/ 


b) y> 

// 




d) 







Figure 4: Numerical examples of the possible situations found in practical problems. 
200 conformations are depicted for each case with the values of V\ in the a>axis and the 
ones of V2 in the y-axis (both in arbitrary energy units). The broken line corresponds 
to the line V2 = Vi. (a) 612 =± 1 and 012 ~ 0. (b) &12 ~ 1 and 012 ~ 200. (c) 
612 ~ 1/2 and 012 ~ 0. (d) 612 ~ 1/2 and 012 ~ 200. 

To close this section, a numerical example is presented that summarizes the 
situations that may be found in practical examples (see [10] for a real case 
of the issues raised) and that makes explicit the aforementioned disadvantages 
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RMSD ER SDER AER REL r 12 d 12 



bi2 ^ 1 


a 12 ~ 


9.6 


-0.7 


9.6 


7.7 


13.6 


0.995 


13.5 


612 c± 1 


ai 2 ~ 200 


199.8 


199.6 


9.4 


199.6 


13.3 


0.996 


13.2 


612 ^ 1/2 


ai2 ~ 


52.7 


-8.4 


52.0 


41.9 


73.8 


0.980 


14.4 


612 ^ 1/2 


ai2 ~ 200 


205.0 


198.0 


52.9 


198.0 


74.9 


0.985 


13.2 


Overestimates a\2 7^ 


Yes 


Yes 


No 


Yes 


No 


No 


No 


Overestimates 612 7^ 1 


Yes 


Yes 


Yes 


Yes 


Yes 


No 


No 


Has units of 


' energy 


Yes 


Yes 


Yes 


Yes 


Yes 


No 


Yes 



Table 1: Values of the statistical quantities RMSD, ER, SDER, AER, REL, n 2 and 
di2 computed in the situations depicted in Fig. |1] All the values are in arbitrary 
energy units except the ones of m which have no units. A summary of the properties 
of each quantity is presented in the bottom part of the table. 



of the commonly used statistical quantities. In Fig. four samples of 200 
conformations are depicted with the values of V\ in the x-axis and the ones 
of V2 in the y-axis (both in arbitrary energy units). The different situations 
correspond to all generic cases in which 012 = or a\2 7^ and in which 612 = 1 
or &12 7^ 1. All the quantities discussed in this section, including <ii2, have been 
computed in each case and their values are presented in Tab. ^ 

From these data and the preceding discussion, some conclusions may be 
extracted. First, among the quantities with energy units, SDER and REL are 
the most proximate to the distance c?i2, although they will overestimate the 
difference between potentials in situations in which there is a constant rescaling 
612 7^ 1 between them. In Fig. for example, the contribution of the points 
that lie further apart from the origin of coordinates is overestimated by all the 
quantities in Eq. for the sole fact that the best fit-line and the line V2 = V\ 
arc not parallel (note that a\2 = and that the random noise associated to 
these points is not particularly large compared to the one that corresponds to 
the points in the central region of the figure). This is due to the fact that all 
quantities in Eq. ^| are based on AV^ which measures the distance of each 
point to the line V2 — V\. A disadvantage that is not shared by c?i2, which 
measures the differences with respect to the best-fit line. 

Second, the Pearson's correlation coefficient ri2 has good properties, al- 
though no physically relevant statements can be extracted from its value due, 
among other reasons, to the fact that it does not have units. In Tab. ^ for 
example, the value of 7~i2 is close enough to 1 to be considered as a sound sign 
of correlation, however, the value of di2 (if we pretend it to be in kcal/mol, 
which could be the case) tells us that the typical indetermination in the energy 
differences, when substituting V\ by V2, is around 13 kcal/mol, a value an order 
of magnitude larger than RT. As explained in Sec. 03 this suggests that the 
relevant behavior of the system may be essentially modified. 

Finally, it is worth stressing that all the considerations made in this section 
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and throughout the article are valid when the physical quantities compared are 
potential energy functions of the same system or closely related systems (see 
Sec. EJ. When other quantities, such as the pK a , charges, dipoles, Born radii, 
etc. or energies of distinct systems are the object of the comparison, the as- 
sessment of the discrepancies rests on different theoretical basis and, frequently, 
only semi-quantitative statements can be made. Acknowledging this limitation, 
the use of any of the quantities studied in this section, including d\2 , may be 
fully justified. Note, in addition, that the numerical effort needed for the cal- 
culation of d\2 is both low and very similar to the one required to compute any 
of the other quantities (see Sec. llUfl . 



8 Additivity 

Frequently, the potentials compared are instances of only a part of the total 
potential energy of the system. If the conclusions extracted, via d(Vx,V2), in 
such a case are pretended to be meaningfully transferred to the total energy, 
this measure of the difference between potentials must obey some reasonable 
additivity rules. Here, we will see that, for some relevant cases in which certain 
independence hypothesis are fulfilled, our distance is approximately additive, 
although, in other relevant situations, it is not. 

For the sake of brevity, the notation will be much relaxed in this section and 
we will assume that we are working with six different potentials, x, y, p, r, q 
and s, that satisfy the following relations: 



x = p + q and y = r + s . (13) 

Conceptually, x and y must be regarded as instances of the same potential 
energy and the same can be said about the pair p and r and the pair q and 
s. Hence, the study of the additivity of our distance rests on finding a way 
of expressing d(x, y) as a function of d(p, r) and d(q, s). If one assumes that p 
is independent from q and that r is independent from s (see the discussion at 
the end of this section for the implications of such an hypothesis), one has that 
r pq = 0, r rs — 0, a 2 = a 2 + a 2 and a 2 = a 2 + a 2 . In such a case, the following 
additivity relation can be written: 

d 2 (x,y) = d 2 (p,r) + d 2 (q, S ) + Ad, (14) 

where: 

Ad := (o 2 v + (T 2 r )(rl r - r 2 xy ) + (a 2 + cr 2 )(r 2 s - r 2 xy ) . (15) 

And the correlation coefficient r xy can be expressed in terms of quantities 
associated to p, r, q and s in the following way (note that r xy is indeed not 
additive) : 

(16) 



K + a 9 2 ) 1 /2( (7 2 + (T 2 ) l/2 
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Now, one can see in Eq. [21 that, if Ad were zero, the square of the distance 
would be exactly additive in the aforementioned sense, making it possible to 
assert, for example, that, if p is proximate to r and q is proximate to s, then 
x = p + q is proximate to y = r + s. Unfortunately, this is not the case. It can 
be shown that Ad > (the distance is over- additive) and, without imposing 
any restriction on the potentials studied, nothing satisfactory can be said. For 
example, a particularly undesirable, albeit also uncommon, situation is that for 
which Cov(p,r) = — Cov(g, s). Such a relation, makes zero the numerator in 
Eg. ITT)1 and, consequently, r xy . Substituting r xy = in Eo.lT51and taking Ad to 
Eg. 1141 one has that, for every allowed value of r pr and r qs : 



Cov(p, r) = -Cov( 9 , s) => d 2 (x, y)=a 2 +a 2 + a 2 + a 2 = a* + a 2 y , (17) 
which is the worst possible value of d(x, y). 

However, there exists a particular class of situations than can be argued to 
be proximate to the situations found in typical cases and for which the additivity 
is approximately accomplished. These special situations are characterized for 
the satisfaction of the following relation: 

Cp/ov = o- q /a s := k . (18) 
When this equality is satisfied, it can be proved that the following quotient: 

Ad rol := Ad/(d 2 (p, r) + d 2 (q, s)) , (19) 

which measures the relative deviation from the exact additivity, does not 
depend on k and can be expressed as a function of only ay, o~ s , r pr and r qs . If, 
in addition, we define c through a s = co>, without loss of generality, we can 
write Ad re i as a function of only r pri r qsi and c as follows: 



(l + c 2 )(l-r2 r + c 2 (l-r2J) 



A^ el = M , ^,V'1 2 '«'> _ 2 „ ■ (20) 



Representing this equation as a three-dimensional surface (see Fig. [SJj) , one 
has a valley whose lowest region lies in the line r pr = r qs and has zero height, 
i.e., Ad ro i(r pr = r qs ) = 0. The slopes of the valley are curved and ascend as one 
moves away from the minimum line, eventually reaching arbitrarily large values 
of Adrei when (r pr ,r qs ) — ► (1,-1) or (r pr ,r qs ) — > (—1,1). 

Numerically, the region for which the value of Ad re i is acceptable is rather 
large. In Fig. [5J>, the contour lines corresponding to Ad re i = 10% are depicted 
for some values of c that may be found in practical cases. It can be seen 
that, as one departs from c = 1, the region for which A<f rc i < 10% gets larger, 
occupying, in any case, the majority of the (r pr , r g;! )-space. Therefore, one 
can conclude that, for the cases in which Eq. ^| is satisfied, the square of 
the distance introduced in this article is approximately additive in the relevant 
situations in which the correlations between p and r and between q and s are 
similar. Moreover, for continuity arguments, one has that, in the case that 
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Figure 5: Graphical study of the additivity of the distance, (a) Ad ro i as a function 
of r vr and r qs for c = 1 (see Ea. I2()|l . Contour lines are plotted at the levels Ad rc i = 
10%, 50%, 100%, 150%. (b) In white, the regions in (r pr , r, s )-space with AgU < 10% 
for different values of c. From left to right and from top to bottom, each figure 
corresponds to c = 1/2, 1/5, 2, 5. In each case, the borders of the Ad rc i < 10% region 
for c = 1 are shown with broken lines for comparison. 



Ea. 1181 were only approximately satisfied, the situation would be proximate to 
the one described in the previous lines. 

Finally, some remarks must be made about the assumption of independence 
between p and q and between r and s. At first sight, one would say that this hy- 
pothesis, as the independence hypothesis in Sec. [21 is under researcher's control. 
In the case of a generic complex system (a spin glass, a random hctcropolymer, 
etc.), this is indeed the case, however, if the object of study is a protein, one 
must be cautious. It is widely believed that the sequences of proteins are the re- 
sult of a million-years-long selection process whose driving force is the search for 
the ability to fold rapidly and robustly [20,21,28,29]. Regarding the interactions 
responsible of the folding process, this means that they have been optimized in 
the sequence space to be minimally frustrated [46], i.e., maximally cooperative. 
In such a case, the correlations between different parts of the total potential 
energy may be large and the study of the additivity done in this section should 
be regarded as a privileged reference situation. 
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9 Application 




Native T f = 500 K If = 750 K T f = I 000 K 

R G = 7.0A <R C ) = 7.9A <R o ) = 10.8A <R G > = 13.3A 

RMSD = 0.0A <RMSD>=3.3A <RMSD>=7.lA <RMSD> = 9.7A 



Figure 6: Native conformation of the Trp-Cage protein together with arbitrarily 
chosen structures from three particular subsets of the working set. The average 
radius of gyration (Re) and the average RMSD with respect to the native struc- 
ture is presented for each set. Both quantities have been computed taking into ac- 
count only the a-carbons. Pictures generated with PyMOL (DeLano, W. L., 2002, 
http : //www.pymol . org |. 

To illustrate one of the possible practical applications of the distance, we 
first study the robustness of the van der Waals energy, as implemented in the 
CHARMM molecular dynamics program [23,24], in a particular system: the de 
novo designed protein known as Trp-Cage [47] (PDB code 1L2Y). 

The program CHARMM itself was used as a conformation generator. From 
the native conformation stored in the Protein Data Bank [48], a 10 ps heating 
dynamics 16 was performed on the system, from an initial temperature T = K 
to eleven different final temperatures (from Tt — 500 K to Tf = 1000 K in 
steps of 50 K). This was repeated 100 times for each final temperature with a 
different seed for the random numbers generator each time. The overall result of 
the process was the production of a working set of 1100 different conformations 
of the protein, whose structures range from close to native (the Tf = 500 K set) 
to completely unfolded (the Tf = 1000 K set) (see Fig.^J. It is worth remarking 
that the short time in which the system was heated (10 ps) and the fact that 
there was no equilibration after this process cause the final temperatures to 
be only labels for the eleven aforementioned sets of conformations. They are, 
by no means, the thermodynamical temperatures of any equilibrium state from 
which the structures are taken. This sets of conformations are only meant to 
sample the representative regions of the phase space. In Fig. El arbitrarily 
chosen structures from three particular sets are shown together with the native 
conformation. The average radius of gyration (Rg) of each set, depicted in the 

16 The c27b4 version of the CHARMM program was used. The molecular dynamics were 
performed using the Leap Frog algorithm therein implemented and the param22 parameter set, 
which is optimized for proteins and nucleic acids. The water was taken into account implicitly 
with the [13] version of the Generalized Born Model built into the program. 
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same figure, must be compared to the radius of gyration of the native state 17 . 
The average RMSD of the structures in each set with respect to the native 
structure, calculated via the quaternion-based method described in [49], is also 
presented 18 . 

The van der Waals energy implemented in CHARMM may be expressed as 
follows: 



(EiEj) 



1/2 




(21) 



where the sum is extended to all the pairs of atoms and the free parameters 
Si and Ri only depend on the type of atom (i.e., two atoms i and j of the same 
type have the same parameters assigned). 

Using the working set of conformations described above of the Trp-Cage 
protein, the robustness of this potential energy function to changes in the free 
parameters Ec and Rc associated to the aliphatic sp3 carbon CH (denoted by 
CT1 in CHARMM) is investigated. To do this, a finite grid-like set of points 
(eq, R c ) is chosen in the bi-dimcnsional parameter space, with e c ranging from 
—0.10 kcal/mol to —0.02 kcal/mol and R c ranging from 2 A to 4 A. Then, for 
each point in this set, different values Sec are added to and subtracted from e c , 
or different values SRc are added to and subtracted from R c independently. 
The potential that corresponds to Ec = £% — <5 £ c is denoted by V\ , the one that 
corresponds to Ec = e c + Sec is denoted by V2 (analogously with Rc) and the 
distance d(Vi,V2) between the two instances is computed in each case (i.e., for 
each central point (e c , R c ) and for each Sec (or SRc)) 19 - 

This procedure allows us to study the dependence of the distance between V\ 
and V2 on the corresponding difference, Sec or SRc, between this two potentials 
in the parameter space for each central point (e c , R c )- This relation may be re- 
garded as one between indctcrmination in the values of the free parameters and 
its influence on the conformational behavior of the system. From this point of 
view, the difference Sec (or SRc) in the parameter space for which the distance 
associated equals RT (see Sec. must be considered an amount of indetermi- 
nation in the parameters that does not involve relevant physical changes in the 
system. Therefore, if the parameters are known to a precision equal or greater 
than the one associated to these particular values of Sec or SRc , the statistical 
indetermination of the parameters may be regarded as harmless. The values of 
this differences (as a function of the central point (e c ,R c )) computed for the 
system studied in this section are depicted in Fig. \7\ 

Although this study only pretends to be an illustration of the concepts in- 
troduced in the previous sections and more features of the van der Waals energy 

17 Both Rq and the RMSD have been computed taking into account only the a-carbons. 

18 The notation for this quantity, which is the root mean square deviation of the atomic 
coordinates of two structures after optimal superposition [49], is the same as the one used 
for the RMSD of the energies in Sec. |3 This choice has been made for consistency with the 
literature, in which this ambiguity is very common. 

19 It can be proved that, in this particular case, the normality hypothesis in cq. |5]is ap- 
proximately fulfilled. 
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a) 




Figure 7: Robustness of the van der Waals energy to changes in some free parameters. 
Relative indetermination in ec (a) and in Rc (b) associated to d(Vi,V2) = RT as 
a function of the central point in the parameter space. Larger values of the relative 
indetermination correspond to greater robustness. 

should be investigated elsewhere, some interesting remarks may be made about 
the results herein presented. One one hand, directly from Fig. one can see 
that the precision needed in Rc is much greater than the one needed in ec, i.e., 
the van der Waals energy is more sensitive to changes in Rc than in ec- This is 
reasonable because V depends on Rc raised to the 12th and 6th power whereas 
ec only enters the expression raised to 1/2 (see Eg. I21|l . On the other hand, 
the allowed indetermination in the parameters grows, in both cases, as Rc di- 
minishes (the dependence on ec is much weaker). The reason for this being 
probably that, when the van der Waals radius Rc is large enough, the atoms 
begin to clash, i.e., the 12th power in Eq. |^ associated to the steric repul- 
sion, begins to dominate over the 6th power term, associated to the attractive 
dispersion forces. 

Finally, we would like to mention that, for the values ec = —0.02 kcal/mol 
and R c = 2.275 A, which are the ones used in the CHARMM param22 param- 
eter file, the allowed indeterminations in the parameters are fec/l £ c| = 35 % 
and SRc/Rc = 3 %, in the region of relatively lower required precision (i.e., the 
relatively more favorable region). Note, however, that the indetermination for 
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Rc corresponds to ~ 0.07 A, which is a rather demanding accuracy and sug- 
gests that, if the van der Waals radii set is changed, the behavior of the system 
may be significantly modified. 
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Figure 8: Comparison between different levels of the theory in the quantum me- 
chanical ab initio study of the Potential Energy Surface (PES) associated with the 
Ramachandran angles of the model dipeptide HCO-L-Ala-NHb. The figure must be 
read as follows: 1) Any numerical set of measures is associated to the comparison 
between the level of the theory in the corresponding row (denoted by Vi) and column 
(denoted by V2). 2) The conformations scatter plot that belongs to a particular set 
of measures is the one that lies in the position which is obtained via reflection (of the 
set) with respect to the blank diagonal. 

As a second brief example of the possible applications of the method, we 
present an exploratory comparison of different levels of theory in the quantum 
mechanical ab initio study of the Potential Energy Surface (PES) associated 
with the Ramachandran angles of the model dipeptide HCO-L-Ala-NH 2 . This 
comparison is an example of the first point discussed in Sec. El 

In [4], the PES of HCO-L-Ala-NH 2 is calculated with two methods, RHF and 
B3LYP, using, for each one, three different basis sets, 3-21G, 6-31+G(d) and 
6-311+-t-G(d,p). To do this, the Ramachandran space is divided in a 12x12 grid 
and, fixing the values of the $ and ^ torsional angles, a geometrical optimization 
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of the structure is performed at each point. This process produces the values 
of six different instances of the same potential energy on a working set of 144 
conformations of the system. 

In Fig. EI each one of the six levels of the theory is compared to the other five 
(using the data provided by A. Perczel) and some relevant numerical measures 
are presented. The distance d\ 2 is given in units of RT (at 300 K), the fitted 
energy reference shift a\ 2 and slope b\ 2 are also shown and the only quantity 
that requires further explanation is N ICS (see Eg. l2"3l below ). 

One of the interests in studying PESs of peptide models lies on the possibility 
of using the results for modeling short oligo-peptides or even proteins [20] . If we 
imagine that we use the PES of HCO-L-Ala-NH2 for constructing a potential 
that describes the behavior of a peptide formed by TV alanine residues, the first 
naive attempt would be to simply add N times the potential energy surface of 
the individual HCO-L-Ala-NH2 (making each term suitably depend on different 
pairs of Ramachandran angles). We may now ask whether the distance between 
two different instances of the iV-residue peptide potential can be related to 
the distance between the corresponding mono-residue ones. It can be proved, 
appropriately choosing the working set of conformations of the larger system 
and using the relations presented in Sec.|Sl that the following relation holds: 



where we have denoted by di 2 (N) the distance between the V\ and V 2 po- 
tentials of the TV-residue peptide. 

Hence, we define iV res as the TV for which di2(N) = RT, representing up to 
which number of residues the criterium given in Sec. [S] will be satisfied: 



Although a much more exhaustive study will be carried out in future works, 
let us extract some meaningful conclusions from the data in Fig. [H] Note, first, 
that the only two cases for which d 12 < RT are RHF/6-3I+G(d) vs. RHF/6- 
311++G(d,p) and B3LYP/6-31+G(d) vs. B3LYP/6-311++G(d,p). This means 
that the convergence in basis is achieved for both methods somewhere between 
6-31+G(d) and 6-311 ++G(d,p) and suggests (for HCO-L-Ala-NH 2 ) that there 
is no need in going above 6-31+G(d). Of course, the fact that N ICS — 22, in 
the B3LYP case, and N xes ~ 12, in the RHF case, places a limit on the size 
of the system for which the similarity of the two levels should be considered 
as sufficient. Finally, note that the distance between RHF/6-311++G(d,p) and 
B3LYP/6-311++G(d,p) is 1.78 RT, which means that the convergence in meth- 
ods has not been achieved and some more accurate method should be studied. 

10 Conclusions 

In this work, a measure d(Vi, V 2 ) of the differences between two instances of the 
same potential energy has been defined and the following points about it have 



d 12 (N) = VNd 12 (l) , 



(22) 




(23) 
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been discussed: 

• It rests on hypothesis whose validity stems from general characteristics 
shared by many complex systems and from the statistical laws of large 
numbers. We believe that, without knowing specific details of the system, 
the statistical approach is unavoidable and, among the many criteria, our 
distance is the most meaningful way of quantifying the differences between 
potentials. 

• It allows to make physically meaningful statements about the way in which 
the energy differences between conformations change (or how the energetic 
ordering of the conformations is altered) upon substitution of one potential 
by the other. 

• It may be applied to at least three practical situations characterized by 
the origin of the differences between the potentials. 

— Different algorithms or approximations are used (potential design). 

— The potential energy function depends on free parameters and the 
two instances correspond to different values of them (robustness). 

— Slightly different systems are compared (mutational studies, effective 
potentials). 

• It presents advantages over the commonly used quantities RMSD, ER, 
SDER and AER that consist mainly of not overestimating irrelevant trans- 
formations on the potentials, such as adding an energy reference or rescal- 
ing one of them. Regarding the Pearson's correlation coefficient r, our 
distance may be considered as a physically meaningful way of giving him 
energy units. Finally, the numerical complexity involved in the calculation 
of d(Vi, V2) (see below) is similar to the one associated to any of the other 
quantities. 

• It is approximately additive for most of the interesting situations encoun- 
tered in practical cases. 

In addition, a first practical example, which consists in the study of the 
robustness to changes in the free parameters of the van der Waals energy in 
CHARMM, and a second one, in which the ab initio potential energy surfaces 
of the HCO-L-Ala-NH2 molecule calculated at different levels of the theory are 
compared, have been presented to illustrate the concepts discussed. 

Finally, we would like to summarize the steps that must be followed to 
compute the distance in a practical case. Although all that follows has been 
said, we believe that a brief recipe could be useful for quick reference: 

1. Generate a working set of independent conformations {qi}fL 1 (sec Scc.|21 
and the last paragraph of SccOJ). 

2. Denote V* := Vi(qi), V 2 l : = ^2 (ft) an d compute the statistical quantities 
Hi, H2, ci and Cov(Vi, V 2 ) in Eq.|H 
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3. With them, calculate the mean-square estimators through Eq. [3J First 
612, then a\2 and, finally <j\2- 

4. If comparing a potential to a more accurate instance, use dyi = V2C12 to 
find the asymmetrical version of the distance between them, and rescale V2 
dividing it by 612 if desired. Otherwise, repeat the steps 2 and 3 changing 
1 <-> 2 in all the expressions to compute 021 and use Eq. 0to finally arrive 
to d(V u V 2 ). 

5. If d(Vi, V2) < RT (or d\ 2 < RT, depending on the case), the two potentials 
may be considered physically equivalent. 

Appendix A: Metric properties 

For completeness, and because, in the case of our distance, it is illustrative 
to do so, we will investigate, in the following, in which situations (which will 
turn out to be rather common) the behavior of d(V\,V2) approaches that of 
a traditional mathematical distance. However, it must be stressed that the 
measure introduced in this article was never intended to be such an object. 
Its meaning is encoded in the statistical statements derived from its value (see 
Sec. |3) and the name distance must be used in a more relaxed manner than the 
one traditionally found in mathematics. 

The object T>(x,y) is said to be a distance (also a metric) in mathematics if 
it satisfies the following properties: 

1. T>(x, y) = x = y 

2. Positivity: V(x,y) > 

3. Symmetry: T>(x,y) =T>(y,x) 

4. Triangle inequality: T>(x, z) < T>(x, y) + T>(y, z) 
Whereas, in the case of d(Vi, V 2 ): 

1. The first property is not fulfilled. One certainly has the implication to the 
left, but the direct implication is false in general. As has been stated in 
the Pt. [31 in Sec.0J the analogous property that d(Vi, V2) satisfies is that 
d(Vi, V2) = is equivalent to V 2 being a linear transformation of V\ and 
vice versa, i.e., to V^(<fi) = &i2Vi(<fi) + a\ 2 and V\{%) = &21V2O&) + 021, 
Vgi € {qi\iL\- Where, additionally, one has that 621 = I/612 and 021 = 
—012/612- The fact that this property of a mathematical distance is not 
satisfied by d(Vi,V2) must be considered an advantage, because, as has 
been remarked in previous sections, it is reasonable to regard as equivalent 
two potentials if there is only a linear transformation between them. 

2. d(Vi,V 2 ) > for every V x and V 2 . 
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3. Directly from its definition in Eq. [3] or from Eq. [H] it is evident that 
d(Vi, V 2 ) is symmetrical under change of V± by Vi- This property is not 
fulfilled by the quantity d\ 2 . However, the situation in which it is rea- 
sonable to use it (the comparison of a particular instance of a potential 
energy V to a less accurate one) is also intrinsically asymmetrical (see the 
final part of Sec.^J. 

4. The triangle inequality, in this context, is a relation that must be ex- 
pressed as a function of the statistical quantities related to three different 
potentials, Vi, V 2 and V3, as follows: 



1 - r? 3 < y/al + oiy/1 - r\ 2 + y^f + a\^\ - rj 3 . (24) 

This relation is not fulfilled for every triplet (Vi, V2, V3), i.e., the distance 
introduced in this article does not satisfy, in general, the triangle inequal- 
ity. A simple counterexample is found if one makes o 3 grow, keeping the 
rest of the quantities in Ea . 1241 constant . For 173 large enough, the relation 
above may be approximated by: 



03 



1 - r i 3 - v 1 - r h 



<Ja 2 1+ a%Jl-rt 2 . (25) 



Then, if rf 3 < r| 3 , one may make (73 even larger and eventually break 
the inequality (in the case that it were not already broken for the value 
of (T3 for which Eq. 1251 is a good approximation) . As a final remark, it 
is worth pointing out that, despite the general mathematical facts stated 
above, there is a particular situation, which is also expected to be similar 
to the situations relevant to be studied, in which the distance has been 
found to satisfy the triangle inequality If one has that o\ = 02 = °3 
(something that is expected to be approximately true in the case that the 
three potentials are proximate), Ea. 1241 turns into a relation involving only 
the correlation coefficients: 



l-r? 3 < V 1 -^2+ V 1 -^ • (26) 

In addition, assuming the hypothesis discussed in Sec. |2 the following 
inequalities can be proved [45] without any further assumptions about the 
potentials: 

r 13 > r 12 r 23 - yjl -r\ 2 ^\-r\ z , (27a) 

ri 3 < r 12 r 23 + ^ 1 - r\ 2 yfl - r 2 23 . (27b) 

We have numerically found that, if the relations in Eq. |^|are satisfied, 
so is the one in Eq. Hence, if a\ = a 2 =173, then, for all values of 
f\2, r 2 3 and ri3, the distance satisfies the triangle inequality. Clearly, for 
continuity, if a\ = 02 = 03 is not exactly but approximately satisfied. 
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then, although the triangle inequality may be broken, it will broken by a 
small relative amount. In short, if one has a\ = a% = 03 approximately, 
one has the triangle inequality also approximately. 
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